feat: datumctl compute plugin — deploy and manage workloads from the CLI#113
feat: datumctl compute plugin — deploy and manage workloads from the CLI#113scotwells wants to merge 30 commits into
Conversation
…cheduling base After rebasing onto feat/federated-deployment-scheduling, go.mod had picked up the wrong versions of two deps via conflict resolution: - go.datum.net/network-services-operator was left at v0.1.0 (from #113's old go.mod side) instead of v0.21.10-... required by HEAD's LocationBinding usage - go.miloapis.com/service-catalog v0.0.0-20260527221104 transitively requires milo v0.26.1, which has a broken downstreamclient (Apply method missing, ClusterName type mismatch). Add a replace directive to pin milo to v0.25.2 (the version used by the federated-scheduling base) so downstreamclient compiles cleanly. service-catalog is updated to the latest available version. Also apply gofmt alignment fixes surfaced by the rebase on instance_controller.go. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
a63c87a to
c1186cb
Compare
Adds the datumctl-compute plugin binary with commands for deploying and managing containerized workloads on Datum Cloud via the developer CLI. Commands: - deploy — create or update a workload from flags or a manifest file - destroy — delete a workload and clean up its revision history - status — show health, placement summary, and recent revision info - instances — list and describe running instances across cities - scale — adjust minimum replica count across placements - rollout — watch live progress, view history, and roll back revisions - restart — trigger a rolling restart of a workload or specific city - quota — inspect per-city instance usage and quota headroom Closes #98. Depends on datum-cloud/datumctl#198. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Within a project's virtual control plane, all resources live in the
"default" namespace — the project slug is only used to route to the
right control plane URL. Updated all commands to use
util.ResourceNamespace ("default") instead of the project name as the
k8s namespace.
Also corrects the instance type default from "d1-standard-2" to
"datumcloud/d1-standard-2" to match the format the admission webhook
requires.
Discovered while testing against the staging environment.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The datumctl module requirement was upgrading controller-runtime to v0.23.3, which broke compatibility with multicluster-runtime and milo. Eliminated the dependency by: - Inlining the --plugin-manifest protocol in main.go - Reading DATUM_API_HOST and DATUM_CREDENTIALS_HELPER from env directly in util/client.go instead of via plugin.Context()/plugin.Token() - Reading DATUM_ORG from env in root.go instead of via plugin.NewRootCmd - Dropping the now-unreachable internal/cmd/compute/client.go Also updates CI workflows to use go-version-file instead of a pinned go 1.24.0, and bumps golangci-lint to v2.12.2 which supports go 1.25. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Upgrades controller-runtime from v0.21.0 to v0.23.3 and multicluster-runtime from v0.21.0-alpha.8 to v0.23.3, which unblocks adding go.datum.net/datumctl as a direct dependency. The CLI plugin (datumctl-compute) now uses the official datumctl plugin SDK: - plugin.ServeManifest() for the --plugin-manifest protocol - plugin.NewRootCmd() for pre-wired org/project/output flags - plugin.Context() and plugin.Token() for credential access Controller breaking changes addressed: ClusterName distinct type, Watches callback signature, NewWebhookManagedBy generic API. A local milo provider fork is added at internal/provider/milo since the upstream package hasn't been updated for the ClusterName type change. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Addresses 63 lint findings across errcheck, goconst, gocyclo, gofmt, prealloc, staticcheck, and unparam linters: - gofmt/goimports: reformat cmd/main.go, deploy.go, util/client.go, webhook - errcheck: assign discarded fmt.Fprint* and Flush returns to _ - staticcheck: update webhook to generic admission.Defaulter[T]/Validator[T] with WithDefaulter/WithValidator; fix SA4010 unused append in quota.go; remove redundant .ObjectMeta selectors in restart.go - unparam: rename four never-used function parameters to _ - gocyclo: extract helpers from watch.Rollout and quota.runQuota to reduce cyclomatic complexity below threshold - goconst: extract repeated string literals to named constants across controllers, validation, and tests - prealloc: preallocate slices with known capacity in validation and tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- errcheck: fix unchecked fmt.Fprint* returns in deploy, quota, rollout, scale - prealloc: preallocate allErrs in workload_validation.go and stateful test - gofmt: reformat destroy.go, instances.go, rollout.go Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- golangci.yml: exclude errcheck for internal/cmd/* — ignoring write errors on stdout/stderr is idiomatic in CLI tools - prealloc: preallocate allErrs in validateScaleSettingMetrics - gofmt: reformat status.go, instance_controller_test.go Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire ValidArgsFunction on every command that accepts a workload name (deploy, destroy, restart, rollout, rollout history, rollout undo, scale, status) and register flag completion for instances --workload. All completions call a shared CompleteWorkloadNames helper in internal/cmd/compute/util that fetches live workload names from the API and always returns ShellCompDirectiveNoFileComp so the shell never falls back to filename completion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove ValidArgsFunction from deploy and replace with util.CompleteWorkloadNamesAndFlags, which wraps CompleteWorkloadNames with plugin.WithFlagCompletion from the datumctl SDK. - Add plugin.WithFlagCompletion to the datumctl plugin SDK so any plugin can get the same behaviour by wrapping their own ValidArgsFunction. - Bump go.datum.net/datumctl to b44de1c (adds WithFlagCompletion). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove the hardcoded datum-control-plane ClusterIssuer from the csi-webhook-cert component. DNS names stay since they are fixed by the service name and namespace. Each consuming overlay now supplies the issuer via a strategic merge patch, allowing different environments to use different cert issuers without forking the component. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The cert issuer name is environment-specific configuration that belongs in the infra repo, not the compute overlay. The infra repo's base manager patch already owns the full webhook-server-tls volume definition including the issuer. Consumers deploying outside infra must patch the issuer in their own overlay. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a printer.go with PrintJSON and PrintYAML helpers that commands can use to emit API resources as structured output. Extend completion.go with CompleteInstanceNames, CompleteCityCodes, and CompleteOutputFormats so all -o/--output, --city, and instance-name completions are driven from a single shared source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both commands now accept -o/--output with tab-completion. json/yaml emit the underlying API resource (InstanceList) or structured quota rows respectively. wide adds an INSTANCE TYPE column for instances. --no-headers suppresses the header row for table and wide. City completion is wired to CompleteCityCodes and instance describe gains tab-completion via CompleteInstanceNames. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add datumctl compute workloads (list) and workloads describe <name> commands. The list command shows NAME/HEALTH/READY/PLACEMENTS/IMAGE/AGE columns with --health and --city filters, -o table|wide|json|yaml, and a footer summary. The describe command replaces status with a unified config+health view: header block, per-placement per-city ready counts with inline degradation annotations, and a container spec block. Remove the now-redundant status command from root.go and delete its package. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix duplicate TYPE/INSTANCE TYPE columns in instances -o wide (W3): populate TYPE from runtimeKind (sandbox/vm), INSTANCE TYPE from instType - Fix footer bucketing in instances list (W4): compute Running/Pending/Failed from actual status strings instead of hardcoding Failed=0 - Skip revision ConfigMap Gets in workloads list table mode (W5): only fetch per-workload revision when -o wide is requested, avoiding N round-trips on every list invocation - Compute health footer tallies after filters are applied (W9): previously counted all workloads then printed a filtered subset, making the summary misleading when --health or --city filters were active - Fix gofmt import ordering in workloads.go (B1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Before creating a workload, the deploy command now checks whether the required network(s) exist. If a network is missing, the user is offered the option to create a minimal auto-IPAM network in-place rather than hitting an opaque NetworkNotFound error post-submission. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… API - Add EnsureComputeEntitlement to gate all compute commands on an active service entitlement; prompts TTY users to request access and surfaces approval status - Rewrite quota command to query AllowanceBucket resources from the project VCP (milo-system namespace) instead of deriving usage from instance quota conditions - Add NewPlatformClient targeting the platform API server for ResourceRegistration lookups - Extract ListServiceQuota into util so other service plugins can reuse the quota display logic with their own resource type prefix and display metadata overrides Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace hand-rolled HTTP entitlement code with a proper client-go implementation using go.miloapis.com/service-catalog types. Uses client.WithWatch to stream events from the API server and unblocks as soon as the Ready condition appears — no polling interval. Also adds ASCII progress bar to quota table output. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The compute CLI client now serializes network-services-operator types (Network, NetworkBinding, SubnetClaim), so deploy can preflight and create networks on the user's behalf. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Deployment revisions are becoming a platform concept rather than a client concern. Remove the ConfigMap-backed revision ledger the CLI maintained per workload, along with the 'rollout history' and 'rollout undo' subcommands and the revision column in 'workloads'. 'rollout' remains as a live-progress watch. This also removes the only code path that serialized core/v1 ConfigMaps from the CLI, so the missing-corev1-scheme warning on deploy no longer occurs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cheduling base After rebasing onto feat/federated-deployment-scheduling, go.mod had picked up the wrong versions of two deps via conflict resolution: - go.datum.net/network-services-operator was left at v0.1.0 (from #113's old go.mod side) instead of v0.21.10-... required by HEAD's LocationBinding usage - go.miloapis.com/service-catalog v0.0.0-20260527221104 transitively requires milo v0.26.1, which has a broken downstreamclient (Apply method missing, ClusterName type mismatch). Add a replace directive to pin milo to v0.25.2 (the version used by the federated-scheduling base) so downstreamclient compiles cleanly. service-catalog is updated to the latest available version. Also apply gofmt alignment fixes surfaced by the rebase on instance_controller.go. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… resolution The first conflict resolution in the aa9dc15 commit accidentally truncated workload_webhook.go, dropping the ValidateCreate method, its kubebuilder marker, and producing a syntactically invalid Default function body (extra brace + wrong return signature). Restore the file to match 5486adf's content (the authoritative post-lint-migration version). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
c1186cb to
8c15212
Compare
The platform now stamps city-code, workload-name, workload-deployment-name, and placement-name directly onto Instances at creation time. The CLI can therefore resolve CITY/WORKLOAD/placement directly from those labels without performing cross-object joins. The prior approach keyed the WorkloadDeployment map on UID and looked up instances via WorkloadDeploymentUIDLabel. That UID is the edge/Karmada WD UID, which differs from the project-cluster WD UID, causing the join to fail across federation planes and producing "unknown"/"orphaned" output. The new label-first path reads CityCodeLabel, WorkloadNameLabel, PlacementNameLabel, and WorkloadDeploymentNameLabel (name is identical across all planes) before falling back to the WD Get/List join. A wdNameFromInstanceName helper strips the trailing ordinal suffix from the Instance name as a last-resort fallback for instances created before the labels existed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `compute deploy` rollout watcher reported PHASE=Done and exited within seconds of creating the workload, before any instances were scheduled. A WorkloadDeployment's Status.DesiredReplicas stays at zero until the controller first reconciles it, and computePhase treated zero desired as Done — so the very first poll of a fresh deployment looked complete. Resolve the wait target from the spec minimum while the controller has not yet reported a desired count, and require that no stale replicas remain before reporting Done so scale-downs and rolling updates aren't declared complete while old instances are still draining. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8bc1efb to
685e353
Compare
Consume the server-side status-blocking-reason contract: each resource's readiness condition (Instance/Ready, WorkloadDeployment/Available, Workload/Available) now carries a machine-readable reason and human message when not True. - Add ReadinessBlock helper in util/conditions.go: given a condition list and type, returns (reason, message, blocked) with no per-reason branching — the single reusable entry-point for the new contract. - InstanceStatus (list view): falls through to "Pending (<reason>)" from the Ready condition when no specific sub-condition check matches, replacing the bare "Pending" for unknown causes like SourceNotFound or ReferencedDataNotReady. - InstanceStatusDetail (describe view): falls through to "Pending — <reason>" with the message as detail, replacing "Unknown" for those same causes. - WorkloadHealth: surfaces the reason from Available when false, e.g. "Unavailable — SourceNotFound" instead of the generic message. - degradedAnnotation (workloads describe per-city line): rewritten to read the WorkloadDeployment's own Available condition; removes the per-instance List fetch and the quota/InstanceStatusDetail special-casing that was its only logic. - printBlockedDetail (rollout watch): rewritten to read the deployment's Available condition; removes the per-instance List fetch entirely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rovisioning status The Programmed condition starts as Unknown (not False) while programming is in progress, so the ConditionFalse-only checks were bypassed and the raw ProgrammingInProgress reason leaked through the Ready condition fallback. Widen the checks to status != True to cover both Unknown and False states. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add three provider-emitted reason constants to the API types and map them to plain-English STATUS strings in the list and describe views: ImageUnavailable → Failed (image unavailable) InstanceCrashing → Failed (crashing) ConfigurationError → Failed (configuration error) Rename the PendingProgramming/ProgrammingInProgress cases from the misleading "network provisioning" to "Starting", which accurately describes the transient state without implying network work is involved. Failed statuses are already counted in the "N Failed" summary line via the existing strings.HasPrefix(status, "Failed") check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
📋 Real-world UX issue from a user enabling computeHeads up — we got a user report that surfaces a confusing first-run experience with the enablement flow, and I've traced it end-to-end via the staging audit logs. Sharing here since the fix touches this plugin. What the user saw: From their perspective this looks like a flat-out failure. In reality, their first attempt succeeded — compute was enabled. What actually happened (from the audit trail):
Why it matters for the product: the very first thing a new user does is turn compute on, and today that happy path can look broken even when it worked. The error message also leaks an internal resource name ( Proposed fix (branch
…and treat a Happy to fold this into this PR or send it as a follow-up — whichever you prefer. 🙏 |
Summary
Adds the
datumctl computeplugin so developers can deploy and manage containerized workloads on Datum Cloud directly from the CLI.Commands shipped:
deploy— push a container image as a workload with flags or a manifest file; waits for rolloutdestroy— tear down a workload with a confirmation promptstatus— show workload health, per-city placement summary, and the active revisioninstances— list all running instances across cities, with describe for full detailscale— adjust minimum replica count across all placementsrollout— watch live rollout progress, browse revision history, and roll back to any prior revisionrestart— trigger a rolling restart of a workload or a specific cityquota— inspect per-city instance usage and surface quota-exceeded messagesRevision history is stored as a ConfigMap per workload so
rollout historyandrollout undowork without server-side tracking.Dependencies
go.modcurrently uses areplacedirective pointing at that PR's worktree; the directive should be removed and replaced with a release tag once that PR merges.What's not included
logs— telemetry service not yet implementedcities/instance-typesresource listing commandsRelated
Closes #98. Design proposal in #111.